# load libraries ----
# reporting
library(knitr)
# visualization
library(ggplot2)
library(ggthemes)
library(patchwork)
# data wrangling
library(dplyr)
library(tidyr)
library(readr)
library(skimr)
library(janitor)
library(magrittr)
library(lubridate)
# modeling
library(tidymodels)
library(vip)
# set other options ----
# scientific notation
options(scipen=999)
# turn off messages and warnings
knitr::opts_chunk$set(
tidy = FALSE,
message = FALSE,
warning = FALSE)
# set the ggplot theme for the document
theme_set(theme_bw())
# set random seed
set.seed(42)26 Predict air pollution levels using linear regression model
Learning Objectives
After completing this tutorial you should be able to
- outline and implement the minimum required steps to train a model using the
tidymodelsframework. - understand the need for training and test data sets and implement
tidymodelspackages to split data set into training and testing sets - understand the utility of cross-validation and implement it in the
tidymodelsframework.
Download the directory for this project here, make sure the directory is unzipped and move it to your bi328 directory. You can open the Rproj for this module either by double clicking on it which will launch Rstudio or by opening Rstudio and then using File > Open Project or by clicking on the Rproject icon in the top right of your program window and selecting Open Project.
There should be a file named 26_linear-regression-models.qmd in that project directory. Use that file to work through this tutorial - you will hand in your rendered (“knitted”) quarto file as your homework assignment. So, first thing in the YAML header, change the author to your name. You will use this quarto document to record your answers. Remember to use comments to annotate your code; at minimum you should have one comment per code set1 you may of course add as many comments as you need to be able to recall what you did. Similarly, take notes in the document as we discuss discussion/reflection questions but make sure that you go back and clean them up for “public consumption”.
1 You should do this whether you are adding code yourself or using code from our manual, even if it isn’t commented in the manual… especially when the code is already included for you, add comments to describe how the function works/what it does as we introduce it during the participatory coding session so you can refer back to it.
Before you get started, let’s make sure to read in all the packages we will need2.
2 Check that all of these packages are installed before you load them!
26.1 Building a machine learning model
We already have a data set with features and outcome variables that we’ve thoroughly explored. Let’s go ahead and read that data set in and make the transformations we concluded are needed so we are ready to starting training a model.
# read & wrangle data set
pm <- read_delim("data/air_pollution.csv", delim = ",") %>%
clean_names() %>%
mutate(across(c(id, fips, zcta), as.factor)) %>%
mutate(city = case_when(city == "Not in a city" ~ "Not in a city",
city != "Not in a city" ~ "In a city"))We are going to use the tidymodels ecosystem which consists of a series of packages to assist in various steps of the model building process. These packages were designed using a standard syntax and the goal of having a standardize workflow and syntax across different types of machine learning algorithms. This also means that it is straightforward to modify pre-processing, algorithm choice, and hyper-parameter tuning for optimization.
These are the individual steps that we will go through to train our model3.
3 You are already familiar with some of these steps for some simple linear regressions that we have run (Steps 3, 5, 6, 7). Use this list to keep track of where we are in the process.
- Split data into testing and training sets.
- Create recipe + assign variable roles
- Specify model, engine, and mode
- Create workflow, add recipe & model
- Fit workflow
- Get predictions
- Use predictions to get performance metrics.
26.2 Split the data
A good solution to this issue is to split our data set into a training and testing data set. The training data set will be used to build and tune our model. Then we can determine how well our model describe the relationship between outcome and predictor values.
Once we have created our testing data set we will set that aside until we have completed optimizing our model with the training set to minimize the bias in evaluating the performance of our model.
We will use rsample::initial_split() to randomly subset our data into a training (2/3 of observations) and test (1/3 of observations), data set.
# split sample
pm_split <- rsample::initial_split(data = pm, prop = 2/3)
# check proportions of split
pm_split<Training/Testing/Total>
<584/292/876>
Next, we will extract the testing and training data to create to separate data.frames()
# training data set
train_pm <- training(pm_split)
# test data set
test_pm <- testing(pm_split)26.3 Prepare for pre-proceesing
Now that we have split the data we need to process the training and testing data, we need to make sure they are compatible and optimized for building the model. This process is called feature engineering and involves assigning variables to specific roles, removing redundant variables if they are present, and scaling data as needed.
26.3.1 Step 1: Specify variable roles with recipe()
In the tidymodels framework we can do this by first creating a “recipe” describing all the individual processing steps we want to take - this is especially helpful if we have multiple data sets we are going to work with and/or if we are going to be re-running the processing multiple time.
First, we will create a recipe that specifies the roles of individual variables, i.e. which are the outcome and which the predictor variables4.
4 Keep in mind that the recipe just describes the steps to take, it does not actually execute them
For our data set we want to specify that value (the PM2.5 concentration measure by air pollution monitors) is our outcome variable, and that our features (predictor variables) are all the other variables with the exception of the monitor ID (id). We don’t want to include this as a predictor variable as it is unique to each monitor so it will only add noise - however, down the line we will still want to have this information so we want to keep it in the data set.
# build recipe
simple_rec <-recipe(train_pm) %>%
update_role(everything(), new_role = "predictor")%>% # specify predictor variables
update_role(value, new_role = "outcome")%>% # specify outcome variable
update_role(id, new_role = "id variable") # specify id as id variable
simple_recLet’s take closer look at our recipe
summary(simple_rec) %>%
kable()| variable | type | role | source |
|---|---|---|---|
| id | factor , unordered, nominal | id variable | original |
| value | double , numeric | outcome | original |
| fips | factor , unordered, nominal | predictor | original |
| lat | double , numeric | predictor | original |
| lon | double , numeric | predictor | original |
| state | string , unordered, nominal | predictor | original |
| county | string , unordered, nominal | predictor | original |
| city | string , unordered, nominal | predictor | original |
| cmaq | double , numeric | predictor | original |
| zcta | factor , unordered, nominal | predictor | original |
| zcta_area | double , numeric | predictor | original |
| zcta_pop | double , numeric | predictor | original |
| imp_a500 | double , numeric | predictor | original |
| imp_a1000 | double , numeric | predictor | original |
| imp_a5000 | double , numeric | predictor | original |
| imp_a10000 | double , numeric | predictor | original |
| imp_a15000 | double , numeric | predictor | original |
| county_area | double , numeric | predictor | original |
| county_pop | double , numeric | predictor | original |
| log_dist_to_prisec | double , numeric | predictor | original |
| log_pri_length_5000 | double , numeric | predictor | original |
| log_pri_length_10000 | double , numeric | predictor | original |
| log_pri_length_15000 | double , numeric | predictor | original |
| log_pri_length_25000 | double , numeric | predictor | original |
| log_prisec_length_500 | double , numeric | predictor | original |
| log_prisec_length_1000 | double , numeric | predictor | original |
| log_prisec_length_5000 | double , numeric | predictor | original |
| log_prisec_length_10000 | double , numeric | predictor | original |
| log_prisec_length_15000 | double , numeric | predictor | original |
| log_prisec_length_25000 | double , numeric | predictor | original |
| log_nei_2008_pm25_sum_10000 | double , numeric | predictor | original |
| log_nei_2008_pm25_sum_15000 | double , numeric | predictor | original |
| log_nei_2008_pm25_sum_25000 | double , numeric | predictor | original |
| log_nei_2008_pm10_sum_10000 | double , numeric | predictor | original |
| log_nei_2008_pm10_sum_15000 | double , numeric | predictor | original |
| log_nei_2008_pm10_sum_25000 | double , numeric | predictor | original |
| popdens_county | double , numeric | predictor | original |
| popdens_zcta | double , numeric | predictor | original |
| nohs | double , numeric | predictor | original |
| somehs | double , numeric | predictor | original |
| hs | double , numeric | predictor | original |
| somecollege | double , numeric | predictor | original |
| associate | double , numeric | predictor | original |
| bachelor | double , numeric | predictor | original |
| grad | double , numeric | predictor | original |
| pov | double , numeric | predictor | original |
| hs_orless | double , numeric | predictor | original |
| urc2013 | double , numeric | predictor | original |
| urc2006 | double , numeric | predictor | original |
| aod | double , numeric | predictor | original |
Success! All of our variables now have specific roles as either the outcome variable, features, and the id column.
26.3.2 Step 2: Specify pre-processing steps
Our next step is to use the recipe::step*() functions to specify any necessary pre-processing steps, this could include a variety of steps needed to transform our data, for example filling in missing values (imputation), converting continuous variables in to discrete variables (binning them), encoding and creating dummy variables, data type conversions, or normalization.
Because we are in the extended tidyverse we can use various functions to help select of variables to apply steps to:
- Use
tidyselectmethods such ascontains(),matches(),starts_with(),ends_with(),everything(),num_range(). - Use the data type of a column, e.g.
all_nominal(),all_numeric(),has_type() - Use the role assigned to variable (see above)
all_predictors(),all_outcomes(),has_role() - We can just use the name of the variable.
Let’s look at a couple of specific examples for what we need to pay attention to during preprocessing.
A typical pre-processing step is what is called one-hot encoding which describes a way that categorical variables are converted to dummy variables (numbers) so that they can be used with certain algorithms that only take certain data types as input. Because we don’t want the algorithm to interpret this variables as continuous numerical variables, we make it explicit that they are binary (0s and 1s, no order).
simple_rec %>%
step_dummy(state, country, city, zcta, one_hot = TRUE)The fips column contains a numeric code for state and county so it is redundant - we should also assign it as an id
simple_rec %>%
update_role("fips", new_role = "county id")We know from our exploratory analysis that we have a series of variables that are redundant and/or highly correlated with each other; we will want to remove these, this can be done using step_corr(). However we want to keep some variables (CMAQ, aod) so we will explicitly exclude them from being removed.
simple_rec %>%
step_corr(all_predictors(), - CMAQ, - aod)Variables with near zero variance will not be informative and will likely only include additional noise, so we would also want to remove those.
simple_rec %>%
step_nzv(all_predictors(), - CMAQ, - aod)The benefit of the recipes package is that we can create one single recipe that summarizes all the steps that we want to take before starting to build are model.
# create final recipe
simple_rec <- recipe(train_pm) %>%
update_role(everything(), new_role = "predictor") %>%
update_role(value, new_role = "outcome") %>%
update_role(id, new_role = "id variable") %>%
update_role("fips", new_role = "county id") %>%
step_dummy(state, county, city, zcta, one_hot = TRUE) %>%
step_corr(all_numeric()) %>%
step_nzv(all_numeric())
simple_rec26.4 Run pre-processing
So far we only have a recipe5. Our next step will be to complete the pre-processing and see how it affects our data set.
5 Think of this as a plan on how we want to pre-process our data
26.4.1 Step 1: Update the recipe with training data using prep()
The function prep() will update the recipe object based on the training data by estimating parameters for pre-processing and updating the variable roles.
# update recipe with training data, retain training data set
prepped_rec <- prep(simple_rec, verbose = TRUE,
retain = TRUE)oper 1 step dummy [training]
oper 2 step corr [training]
oper 3 step nzv [training]
The retained training set is ~ 0.27 Mb in memory.
names(prepped_rec) [1] "var_info" "term_info" "steps" "template"
[5] "levels" "retained" "requirements" "tr_info"
[9] "orig_lvls" "last_term_info"
prepped_rec is a list; the various elements contain a lot of useful information about our training set.
steps: contains the pre-processing steps that were runvar_infocontains the original variable informationterm_infois the updated variable after pre-processinglevelsare the new levels, the original levels are inorig_lvlstr_infocontains info about the training data set size and completeness
26.4.2 Step 2: Extract the pre-processing training data set using bake()
We retained our pre-processed training data set using retain = TRUE so we can take a look at our training data using recipes::bake().
# extract training data set
baked_train <- bake(prepped_rec, new_data = NULL)
# overview
glimpse(baked_train)Rows: 584
Columns: 39
$ id <fct> 36001.0012, 19045.0019, 8069.0009, 6029.00…
$ value <dbl> 8.225743, 12.156484, 6.679167, 21.201299, …
$ fips <fct> 36001, 19045, 8069, 6029, 13153, 8013, 390…
$ lat <dbl> 42.68075, 41.82328, 40.57129, 35.38557, 32…
$ lon <dbl> -73.75733, -90.21198, -105.07969, -119.015…
$ cmaq <dbl> 6.934664, 8.476871, 3.848796, 5.747609, 11…
$ zcta_area <dbl> 20537352, 287014236, 36755388, 10828653, 4…
$ zcta_pop <dbl> 11303, 28293, 33409, 12248, 2888, 6578, 19…
$ imp_a500 <dbl> 25.23788927, 0.00000000, 28.98961938, 54.6…
$ imp_a15000 <dbl> 13.2312860, 2.8746265, 5.9575519, 19.12959…
$ county_area <dbl> 1354056384, 1799821282, 6723613486, 210615…
$ county_pop <dbl> 304204, 49116, 299630, 839631, 139900, 294…
$ log_dist_to_prisec <dbl> 4.645390, 7.393982, 6.318517, 5.811309, 6.…
$ log_pri_length_5000 <dbl> 10.720038, 8.517193, 8.517193, 8.518782, 8…
$ log_pri_length_25000 <dbl> 13.10303, 12.14475, 12.27618, 11.00095, 13…
$ log_prisec_length_500 <dbl> 8.320712, 6.214608, 6.214608, 7.931814, 6.…
$ log_prisec_length_1000 <dbl> 9.652363, 7.600902, 8.783157, 9.415249, 8.…
$ log_prisec_length_5000 <dbl> 12.022923, 11.109942, 10.758653, 11.446320…
$ log_prisec_length_10000 <dbl> 13.36628, 12.00213, 12.22083, 12.08351, 11…
$ log_nei_2008_pm10_sum_10000 <dbl> 5.027644654, 7.022428553, 5.388681313, 5.8…
$ log_nei_2008_pm10_sum_15000 <dbl> 5.83021624, 7.02829583, 5.63414841, 6.0111…
$ log_nei_2008_pm10_sum_25000 <dbl> 6.7434101, 7.0489046, 6.6266994, 6.1962012…
$ popdens_county <dbl> 224.661250, 27.289376, 44.563835, 39.86555…
$ popdens_zcta <dbl> 550.363065, 98.576992, 908.955171, 1131.07…
$ nohs <dbl> 4.3, 4.3, 1.7, 12.5, 0.8, 0.0, 3.9, 4.3, 1…
$ somehs <dbl> 9.1, 8.1, 5.0, 17.0, 7.4, 0.0, 10.2, 12.8,…
$ hs <dbl> 23.2, 38.2, 14.9, 28.6, 14.4, 0.0, 46.6, 3…
$ somecollege <dbl> 14.6, 22.7, 21.7, 20.4, 40.6, 0.0, 21.0, 2…
$ associate <dbl> 4.6, 11.4, 7.7, 8.5, 12.7, 2.1, 8.0, 7.5, …
$ bachelor <dbl> 21.3, 11.2, 29.7, 9.1, 16.5, 85.4, 7.0, 9.…
$ grad <dbl> 22.9, 4.1, 19.3, 3.9, 7.6, 12.5, 3.4, 3.7,…
$ pov <dbl> 27.000000, 10.100000, 9.500000, 31.400000,…
$ hs_orless <dbl> 36.6, 50.6, 21.6, 58.1, 22.6, 0.0, 60.7, 5…
$ urc2013 <dbl> 3, 5, 3, 3, 4, 3, 1, 6, 1, 4, 2, 4, 4, 5, …
$ aod <dbl> 42.62500, 48.55556, 29.33333, 62.50000, 35…
$ state_California <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
$ state_Illinois <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ state_Ohio <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
$ city_Not.in.a.city <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
Compare this to our original data set
glimpse(pm)Rows: 876
Columns: 50
$ id <fct> 1003.001, 1027.0001, 1033.1002, 1049.1003,…
$ value <dbl> 9.597647, 10.800000, 11.212174, 11.659091,…
$ fips <fct> 1003, 1027, 1033, 1049, 1055, 1069, 1073, …
$ lat <dbl> 30.49800, 33.28126, 34.75878, 34.28763, 33…
$ lon <dbl> -87.88141, -85.80218, -87.65056, -85.96830…
$ state <chr> "Alabama", "Alabama", "Alabama", "Alabama"…
$ county <chr> "Baldwin", "Clay", "Colbert", "DeKalb", "E…
$ city <chr> "In a city", "In a city", "In a city", "In…
$ cmaq <dbl> 8.098836, 9.766208, 9.402679, 8.534772, 9.…
$ zcta <fct> 36532, 36251, 35660, 35962, 35901, 36303, …
$ zcta_area <dbl> 190980522, 374132430, 16716984, 203836235,…
$ zcta_pop <dbl> 27829, 5103, 9042, 8300, 20045, 30217, 901…
$ imp_a500 <dbl> 0.01730104, 1.96972318, 19.17301038, 5.782…
$ imp_a1000 <dbl> 1.4096021, 0.8531574, 11.1448962, 3.867647…
$ imp_a5000 <dbl> 3.3360118, 0.9851479, 15.1786154, 1.231141…
$ imp_a10000 <dbl> 1.9879187, 0.5208189, 9.7253870, 1.0316469…
$ imp_a15000 <dbl> 1.4386207, 0.3359198, 5.2472094, 0.9730444…
$ county_area <dbl> 4117521611, 1564252280, 1534877333, 201266…
$ county_pop <dbl> 182265, 13932, 54428, 71109, 104430, 10154…
$ log_dist_to_prisec <dbl> 4.648181, 7.219907, 5.760131, 3.721489, 5.…
$ log_pri_length_5000 <dbl> 8.517193, 8.517193, 8.517193, 8.517193, 9.…
$ log_pri_length_10000 <dbl> 9.210340, 9.210340, 9.274303, 10.409411, 1…
$ log_pri_length_15000 <dbl> 9.630228, 9.615805, 9.658899, 11.173626, 1…
$ log_pri_length_25000 <dbl> 11.32735, 10.12663, 10.15769, 11.90959, 12…
$ log_prisec_length_500 <dbl> 7.295356, 6.214608, 8.611945, 7.310155, 8.…
$ log_prisec_length_1000 <dbl> 8.195119, 7.600902, 9.735569, 8.585843, 9.…
$ log_prisec_length_5000 <dbl> 10.815042, 10.170878, 11.770407, 10.214200…
$ log_prisec_length_10000 <dbl> 11.88680, 11.40554, 12.84066, 11.50894, 12…
$ log_prisec_length_15000 <dbl> 12.205723, 12.042963, 13.282656, 12.353663…
$ log_prisec_length_25000 <dbl> 13.41395, 12.79980, 13.79973, 13.55979, 13…
$ log_nei_2008_pm25_sum_10000 <dbl> 0.318035438, 3.218632928, 6.573127301, 0.0…
$ log_nei_2008_pm25_sum_15000 <dbl> 1.967358961, 3.218632928, 6.581917457, 3.2…
$ log_nei_2008_pm25_sum_25000 <dbl> 5.067308, 3.218633, 6.875900, 4.887665, 4.…
$ log_nei_2008_pm10_sum_10000 <dbl> 1.35588511, 3.31111648, 6.69187313, 0.0000…
$ log_nei_2008_pm10_sum_15000 <dbl> 2.26783411, 3.31111648, 6.70127741, 3.3500…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 3.311116, 7.148858, 5.171920, 4.…
$ popdens_county <dbl> 44.265706, 8.906492, 35.460814, 35.330814,…
$ popdens_zcta <dbl> 145.716431, 13.639555, 540.887040, 40.7189…
$ nohs <dbl> 3.3, 11.6, 7.3, 14.3, 4.3, 5.8, 7.1, 2.7, …
$ somehs <dbl> 4.9, 19.1, 15.8, 16.7, 13.3, 11.6, 17.1, 6…
$ hs <dbl> 25.1, 33.9, 30.6, 35.0, 27.8, 29.8, 37.2, …
$ somecollege <dbl> 19.7, 18.8, 20.9, 14.9, 29.2, 21.4, 23.5, …
$ associate <dbl> 8.2, 8.0, 7.6, 5.5, 10.1, 7.9, 7.3, 8.0, 4…
$ bachelor <dbl> 25.3, 5.5, 12.7, 7.9, 10.0, 13.7, 5.9, 17.…
$ grad <dbl> 13.5, 3.1, 5.1, 5.8, 5.4, 9.8, 2.0, 8.7, 2…
$ pov <dbl> 6.1, 19.5, 19.0, 13.8, 8.8, 15.6, 25.5, 7.…
$ hs_orless <dbl> 33.3, 64.6, 53.7, 66.0, 45.4, 47.2, 61.4, …
$ urc2013 <dbl> 4, 6, 4, 6, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2, …
$ urc2006 <dbl> 5, 6, 4, 5, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2, …
$ aod <dbl> 37.36364, 34.81818, 36.00000, 33.08333, 43…
26.4.3 Step 3: Extract the pre-processed testing data using bake().
We’ve extracted the training data set but we should also apply the recipe we designed to our test data set to check for any issues (e.g. introduction of NA values).
# extract test data set
baked_test_pm <- bake(prepped_rec, new_data = test_pm)
# compare effect
glimpse(baked_test_pm)Rows: 292
Columns: 39
$ id <fct> 1003.001, 1049.1003, 1055.001, 1069.0003, …
$ value <dbl> 9.597647, 11.659091, 12.375394, 10.508850,…
$ fips <fct> 1003, 1049, 1055, 1069, 1073, 1089, 1097, …
$ lat <dbl> 30.49800, 34.28763, 33.99375, 31.22636, 33…
$ lon <dbl> -87.88141, -85.96830, -85.99107, -85.39077…
$ cmaq <dbl> 8.098836, 8.534772, 9.241744, 9.121892, 9.…
$ zcta_area <dbl> 190980522, 203836235, 154069359, 162685124…
$ zcta_pop <dbl> 27829, 8300, 20045, 30217, 21725, 21297, 2…
$ imp_a500 <dbl> 0.01730104, 5.78200692, 16.49307958, 19.13…
$ imp_a15000 <dbl> 1.4386207, 0.9730444, 5.1612102, 4.7401296…
$ county_area <dbl> 4117521611, 2012662359, 1385618994, 150173…
$ county_pop <dbl> 182265, 71109, 104430, 101547, 658466, 334…
$ log_dist_to_prisec <dbl> 4.648181, 3.721489, 5.261457, 7.112373, 4.…
$ log_pri_length_5000 <dbl> 8.517193, 8.517193, 9.066563, 8.517193, 8.…
$ log_pri_length_25000 <dbl> 11.32735, 11.90959, 12.01356, 10.12663, 12…
$ log_prisec_length_500 <dbl> 7.295356, 7.310155, 8.740680, 6.214608, 8.…
$ log_prisec_length_1000 <dbl> 8.195119, 8.585843, 9.627898, 7.600902, 9.…
$ log_prisec_length_5000 <dbl> 10.815042, 10.214200, 11.728889, 12.298627…
$ log_prisec_length_10000 <dbl> 11.88680, 11.50894, 12.76828, 12.99414, 11…
$ log_nei_2008_pm10_sum_10000 <dbl> 1.35588511, 0.00000000, 4.43719884, 0.9288…
$ log_nei_2008_pm10_sum_15000 <dbl> 2.267834, 3.350044, 4.462679, 3.674739, 5.…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 5.171920, 4.678311, 3.744629, 8.…
$ popdens_county <dbl> 44.265706, 35.330814, 75.367038, 67.619664…
$ popdens_zcta <dbl> 145.71643, 40.71896, 130.10374, 185.73917,…
$ nohs <dbl> 3.3, 14.3, 4.3, 5.8, 2.8, 1.2, 5.3, 23.9, …
$ somehs <dbl> 4.9, 16.7, 13.3, 11.6, 8.2, 3.1, 14.2, 17.…
$ hs <dbl> 25.1, 35.0, 27.8, 29.8, 32.0, 15.1, 35.7, …
$ somecollege <dbl> 19.7, 14.9, 29.2, 21.4, 28.9, 20.5, 23.8, …
$ associate <dbl> 8.2, 5.5, 10.1, 7.9, 7.4, 6.5, 8.3, 4.3, 5…
$ bachelor <dbl> 25.3, 7.9, 10.0, 13.7, 13.5, 30.4, 8.3, 4.…
$ grad <dbl> 13.5, 5.8, 5.4, 9.8, 7.1, 23.3, 4.5, 2.0, …
$ pov <dbl> 6.1, 13.8, 8.8, 15.6, 5.1, 5.2, 18.4, 29.5…
$ hs_orless <dbl> 33.3, 66.0, 45.4, 47.2, 43.0, 19.4, 55.2, …
$ urc2013 <dbl> 4, 6, 4, 4, 1, 3, 3, 1, 1, 3, 2, 2, 6, 2, …
$ aod <dbl> 37.36364, 33.08333, 43.41667, 33.00000, 36…
$ state_California <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ state_Illinois <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ state_Ohio <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ city_Not.in.a.city <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
26.5 Specify the model
Quick reminder of where we are at in the process - here are the steps we’ve taken so far:
- acquire data set with outcome variable and set of predictor variables
- split data set into training and testing sets using
rsamplepackage - assign variable types, specify and prep pre-processing and extract our pre-processed data sets using
recipespackages.
Now it’s to time actually specify our model. The tidymodels package to do this is called parsnip.
There are four things we need to specify about our model
- The model type, e.g. a linear regression as we have done previously, here we will use a random forest approach.
- The engine (underlying function/package) to implement the selected model type, e.g. previously we have used
lmas our engine to perform a linear regression. - The mode of learning, so far we’ve never explicitly specified this, typically this would be a classification or regression.
- Additional arguments specific to the specified model or package.
26.5.1 Step 1: Specify the model type
Let’s start by specifying the model type as a linear regression
lm_PM <- linear_reg() # specify model type26.5.2 Step 2: Specify the engine
We want to use the ordinary least squares method to fit our linear regression. There are multiple implementations in various packages so we need to tell parsnip exactly which function/package to implement.
lm_PM <- linear_reg() %>% # specify model type
set_engine("lm") # set engine26.5.3 Step 3: Specify mode of learning
Some packages can do both classifications and prediction, so we should explicitly specify the mode, in this case we want to perform a regression.
lm_PM <- linear_reg() %>% # specify model type
set_engine("lm") %>% # set engine
set_mode("regression") # set mode26.6 Define workflow & Fit the model
Now we’re ready to actually fit the model.
We are going to use the package workflows to keep track of the pre-processing steps and model specification. Down the line this will help us during the optimization process because we will be able automate steps, and it will be straightforward to add post-processing operations.
Let’s create our workflow which will incorporate our recipe for pre-processing and the model we just specified6.
6 We did use prep() to take a look at our data set during pre-processing, but it actually is not a necessary step
PM_wflow <-workflows::workflow() %>%
workflows::add_recipe(simple_rec) %>%
workflows::add_model(lm_PM)We can call up the workflow to get an overview of our model fitting process including our pre-processing steps and model specifications.:
PM_wflow══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps
• step_dummy()
• step_corr()
• step_nzv()
── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Computational engine: lm
After all of that, we are now ready to “prepare the recipe”, i.e. we will estimate the parameters to fit the model to our full training data set using parsnip::fit().
PM_wflow_fit <- fit(PM_wflow, data = train_pm)26.7 Assessing the model
Let’s take a look at our fitted model7.
7 Because we used a workflow, we will first have to extract the fitted model from the workflow and then we can use broom::tidy() to look at the summary table you are already familiar with
wflowoutput <- PM_wflow_fit %>%
pull_workflow_fit() %>%
broom::tidy()
wflowoutput %>%
kable()| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 182.1641658 | 115.6566032 | 1.5750434 | 0.1158235 |
| lat | 0.0364521 | 0.0234493 | 1.5545071 | 0.1206408 |
| lon | 0.0318298 | 0.0093350 | 3.4097334 | 0.0006981 |
| cmaq | 0.2570252 | 0.0422701 | 6.0805464 | 0.0000000 |
| zcta_area | 0.0000000 | 0.0000000 | -1.3231225 | 0.1863465 |
| zcta_pop | 0.0000089 | 0.0000052 | 1.7113227 | 0.0875874 |
| imp_a500 | 0.0081746 | 0.0070606 | 1.1577786 | 0.2474588 |
| imp_a15000 | -0.0033947 | 0.0117492 | -0.2889270 | 0.7727465 |
| county_area | 0.0000000 | 0.0000000 | -0.8312412 | 0.4061994 |
| county_pop | -0.0000001 | 0.0000001 | -0.7799422 | 0.4357616 |
| log_dist_to_prisec | 0.1481777 | 0.1056331 | 1.4027586 | 0.1612551 |
| log_pri_length_5000 | -0.2427120 | 0.1274198 | -1.9048225 | 0.0573264 |
| log_pri_length_25000 | 0.2344382 | 0.1457027 | 1.6090176 | 0.1081884 |
| log_prisec_length_500 | 0.2806522 | 0.1613107 | 1.7398243 | 0.0824513 |
| log_prisec_length_1000 | -0.1065166 | 0.1995609 | -0.5337547 | 0.5937277 |
| log_prisec_length_5000 | 0.6436848 | 0.3020454 | 2.1310865 | 0.0335264 |
| log_prisec_length_10000 | -0.1054887 | 0.3243774 | -0.3252035 | 0.7451512 |
| log_nei_2008_pm10_sum_10000 | 0.0685339 | 0.0790329 | 0.8671565 | 0.3862358 |
| log_nei_2008_pm10_sum_15000 | 0.0010473 | 0.1041524 | 0.0100555 | 0.9919807 |
| log_nei_2008_pm10_sum_25000 | 0.1604706 | 0.0810542 | 1.9797943 | 0.0482265 |
| popdens_county | 0.0000165 | 0.0000625 | 0.2648908 | 0.7911932 |
| popdens_zcta | -0.0000174 | 0.0000472 | -0.3674112 | 0.7134542 |
| nohs | -1.8426392 | 1.1563849 | -1.5934481 | 0.1116363 |
| somehs | -1.8414708 | 1.1572149 | -1.5912955 | 0.1121198 |
| hs | -1.8180085 | 1.1559237 | -1.5727755 | 0.1163480 |
| somecollege | -1.8373995 | 1.1559603 | -1.5895005 | 0.1125242 |
| associate | -1.9058983 | 1.1580924 | -1.6457222 | 0.1003943 |
| bachelor | -1.8482081 | 1.1558335 | -1.5990262 | 0.1103911 |
| grad | -1.8463020 | 1.1565096 | -1.5964433 | 0.1109663 |
| pov | -0.0007694 | 0.0103864 | -0.0740799 | 0.9409739 |
| hs_orless | NA | NA | NA | NA |
| urc2013 | 0.2654620 | 0.0904887 | 2.9336476 | 0.0034903 |
| aod | 0.0202479 | 0.0054056 | 3.7457080 | 0.0001989 |
| state_California | 3.3887909 | 0.3955334 | 8.5676472 | 0.0000000 |
| state_Illinois | -0.2502806 | 0.4006262 | -0.6247236 | 0.5324125 |
| state_Ohio | 0.6901930 | 0.3743078 | 1.8439185 | 0.0657347 |
| city_Not.in.a.city | 0.1167744 | 0.3062065 | 0.3813584 | 0.7030851 |
With models that have this many predictor values it can be helpful to understand which are the most important (i.e. have the strongest predictive value).
The function vip::vip() creates a barplot comparing the variable importance scores for each predictor variable ordered from most important.
# pull top 10 most important variables
PM_wflow_fit %>%
pull_workflow_fit() %>%
vip(num_features = 10)26.8 Evaluate model performance
Now that we have a fitted model we need to determine how well our model actually performs. The way to do this is to compare the similarity of predicted estimates and observed values of the outcome variable8.
8 Recall, that a machine learning optimization problem tires to minimize the distance between the predicted outcome \(\hat{Y} = f(X)\) and actual outcome \(Y\) using the predictor variables \(X\) as input for our function \(f\) that we want to estimate.
We can use the function broom::augment() to pull the observed and fitted values from our workflow.
fit_PM <- PM_wflow_fit %>%
pull_workflow_fit()
fit_PM <- augment(fit_PM$fit, data = baked_train) %>%
select(value, .fitted:.std.resid)
fit_PM# A tibble: 584 × 6
value .fitted .hat .sigma .cooksd .std.resid
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 8.23 11.1 0.0279 1.95 0.00182 -1.51
2 12.2 11.8 0.0317 1.95 0.0000407 0.212
3 6.68 8.18 0.0370 1.95 0.000657 -0.784
4 21.2 12.8 0.0523 1.92 0.0303 4.44
5 11.7 10.9 0.0678 1.95 0.000387 0.437
6 6.49 8.24 0.207 1.95 0.00736 -1.01
7 11.9 12.7 0.0532 1.95 0.000238 -0.391
8 10.5 8.80 0.0458 1.95 0.00109 0.904
9 8.64 10.4 0.0536 1.95 0.00138 -0.937
10 11.2 12.7 0.0410 1.95 0.000769 -0.805
# ℹ 574 more rows
Let’s compare the fitted (predicted) outcome values (fitted values) \(\hat{Y}\) to the observed outcome values \(Y\).
ggplot(fit_PM, aes(x = value, y = .fitted)) +
geom_point() +
geom_abline(slope = 1) +
coord_fixed(ratio = 1) +
scale_y_continuous(limits = c(5, 20)) +
labs(x = "observed outcome values", y = "predicted outcome values")Our range of predicted outcomes appears to be smaller compared to the actual observed values.
We can quantify our model performances using the root mean square error (rmse)9.
9 Recall, that this is the root of the sum of all the distances between observed and fitter values over the number of observations \(RMSE = \frac{\sqrt{\sum_{i=1}^{n}({\hat{y}_{t}-y_{t}})^{2}}}{n}\)
We can calculate RMSE using yardsticks:rmse()
# calculate rmse
fit_PM %>%
rmse(truth = value, estimate = .fitted)# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 1.89
26.9 Cross validation
We previously realized that without context rmse is not that helpful, the smaller the value the better - but small compared to what? One option would be to compare whether the rmse for our training data set is comparable to our test data set, indicating that it works similarly well for both data sets. However, we only get to use our test data set once - at the very end, once we have fit and tuned all the parameters for our model, which means we don’t get to use it during the optimization proess.
This poses a problem that we solve using the process of cross-validation. To do this we further split our training data set into multiple data sets for a better assessment and optimization of our model before turning to our test data set.
While there are several methods for cross validation, we will use k-fold (or v-fold) cross validation which is straightforward but effective. To do this, we split our data into \(k\) equally sized subsets (folds). Then we take all but one of these subset (this is called the holdout), fit the model and assess the performance of that model using the holdout set. We keep repeating this process until every subset has been left out once.
We are going to ignore the spatial dependence of our data set and randomly subset our training data set into four cross-validation folds. A more involved version of his would involve leaving out blocks of monitors based on geography to test for differences in geography impacting the performance of the data.
We can use rsample::vfold() to create the cross-validation fold10.
10 we will create 4 sets, this is low, typically you would use 10
# create four subsets
kfold_pm <- rsample::vfold_cv(data = train_pm, v = 4)
kfold_pm# 4-fold cross-validation
# A tibble: 4 × 2
splits id
<list> <chr>
1 <split [438/146]> Fold1
2 <split [438/146]> Fold2
3 <split [438/146]> Fold3
4 <split [438/146]> Fold4
We can use tune::fit_resamples() to perform the cross-validation assessment. This automates the process where it will first use fold 1-3, fit the model, the assess the performance using fold 4, then it would use fold 1, 2, and 4 to fit the model and fold 3 for assessment etc. There should always be as many iterations as there are folds.
xVal <- fit_resamples(PM_wflow, kfold_pm)Our next step is would be to take a look at a set of performance metrics based on the fit of the cross validation assessment. For example, tune::show_best() will calculate the mean RMSE value across all four folds.
show_best(xVal, metric = "rmse")# A tibble: 1 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 1.99 4 0.0695 Preprocessor1_Model1
We just built a predictive model using a linear regression. This is an example what we generally refer to as a supervised Machine Learning model11. You have probably primarily used it as a statistical model, i.e. we are interested in making inferences about the population as a whole and understanding relationships between dependent and independent variables. But it can also be used for predictive modeling, where we train the model based on features. However, the more predictor variables are included, the more difficult it becomes to calculate the coefficients.
11 Supervised models use labeled data to “supervise” the building of the model.
Random Forest is a decision tree method that can also be used for a regression analysis, it is also a supervised ML model. We will explore what this looks like in the next chapter. You will quickly see that the individual steps and even most of the code is the same, with the main difference being that we define a different model and engine.